MATHEMATICAL ENGINEERING TECHNICAL REPORTS An Asymptotically Optimal Policy for Finite Support Models in the Multiarmed Bandit Problem
نویسندگان
چکیده
We propose minimum empirical divergence (MED) policy for the multiarmed bandit problem. We prove asymptotic optimality of the proposed policy for the case of finite support models. In our setting, Burnetas and Katehakis [3] has already proposed an asymptotically optimal policy. For choosing an arm our policy uses a criterion which is dual to the quantity used in [3]. Our criterion is easily computed by a convex optimization technique and has an advantage in practical implementation. We confirm by simulations that MED policy demonstrates good performance in finite time in comparison to other currently popular policies.
منابع مشابه
MATHEMATICAL ENGINEERING TECHNICAL REPORTS Finite-time Regret Bound of a Bandit Algorithm for the Semi-bounded Support Model
In this paper we consider stochastic multiarmed bandit problems. Recently a policy, DMED, is proposed and proved to achieve the asymptotic bound for the model that each reward distribution is supported in a known bounded interval, e.g. [0, 1]. However, the derived regret bound is described in an asymptotic form and the performance in finite time has been unknown. We inspect this policy and deri...
متن کاملAn Asymptotically Optimal Bandit Algorithm for Bounded Support Models
Multiarmed bandit problem is a typical example of a dilemma between exploration and exploitation in reinforcement learning. This problem is expressed as a model of a gambler playing a slot machine with multiple arms. We study stochastic bandit problem where each arm has a reward distribution supported in a known bounded interval, e.g. [0, 1]. In this model, Auer et al. (2002) proposed practical...
متن کاملOn the efficiency of Bayesian bandit algorithms from a frequentist point of view
In this contribution, we argue that algorithms derived from the Bayesian modelling of the multiarmed bandit problem are also optimal when evaluated using the frequentist cumulated regret as a measure of performance. We first show that the classical Gittins argument can be applied to convert the finite-horizon Bayesian multiarmed bandit problem into an MDP planning task that is numerically solva...
متن کاملAsymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching - Automatic Control, IEEE Transactions on
We consider multiarmed bandit problems with switching cost, define uniformly good allocation rules, and restrict attention to such rules. We present a lower bound on the asymptotic performance of uniformly good allocation rules and construct an allocation scheme that achieves the bound. We discover that despite the inclusion of a switching cost the proposed allocation scheme achieves the same a...
متن کاملOptimal Policies for a Class of Restless Multiarmed Bandit Scheduling Problems with Applications to Sensor Management
Consider the Markov decision problems (MDPs) arising in the areas of intelligence, surveillance, and reconnaissance in which one selects among different targets for observation so as to track their position and classify them from noisy data [9], [10]; medicine in which one selects among different regimens to treat a patient [1]; and computer network security in which one selects different compu...
متن کامل